Photo by Nuno Alberto on Unsplash
According to Worldometer, China is the most populated country in the world. Chinese character (hànzì) has been the basis of other languages, such as Japanese (kanji), Korean (hanja), and Vietnamese (chữ Hán). Nevertheless, not many people other than the Chinese citizen understand Chinese characters, mainly caused by the complexity of the characters. Therefore, in this project let’s make a machine learning model that can recognize Chinese character, especially Chinese digits. The data consists of images of handwritten Chinese characters from 0 to 10, 100, 1000, 10000 (ten thousand), and 100000000 (a hundred million).
A digital image consists of pixels or picture elements. According to Cambridge Dictionary, pixel is the smallest unit of an image in a digital platform. Each pixel contains a value, called pixel value. A pixel value ranges between 0 to 255 describing its brightness (for grayscale images) or its color strength (for colored images).
The size of an image relies heavily on the number of pixels it has. For example, if you have an image with the size \(1000 \times 1500\), it means that your image has the width of 1000 pixels and height of 1500 pixels. The total number of pixels in that image is \(1000 \times 1500 = 1500000\) (a million and a half) pixels.
Let’s import the required libraries. In this project, we will be
using keras library to make our machine learning model and
caret to split the data and make a confusion matrix.
library(keras)
library(dplyr)
library(caret)The data we are going to be using are taken from Kaggle; the Chinese Digit Recognizer as the main dataset, and Chinese MNIST as the indexing dataset containing labels.
chinese_mnist <- read.csv("data-input/chineseMNIST.csv") # main dataset
chinese_meta <- read.csv("data-input/chinese_mnist.csv") # index (contains labels)
# combine the metadata to the main dataset
chinese_mnist$label <- chinese_meta$code
chinese_mnist$value <- chinese_meta$value
# rearrange the dataset so "label", "value", and "character" are the first two columns
chinese_mnist <- chinese_mnist %>%
select(label, value, character, everything())Let’s inspect the training data by using the head()
function
head(chinese_mnist)Explanation:
label = numerical code for target variable
value = the actual value of each character
character = Chinese number character in
unicode
pixel_0, …, pixel_4096 = predictors, in
pixel value
Let’s check if we have missing values in our dataset.
sum(is.na(chinese_mnist))## [1] 0
We have no missing values, sweet!
Now let’s check the number of data we have for each character.
table(chinese_mnist$character)##
## 一 七 万 三 九 二 五 亿 八 六 十 千 四 百 零
## 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
The result above shows that there are no class imbalance, great!
Next, let’s count the number of predictors that we have.
dim(chinese_mnist[-c(1:3)])## [1] 15000 4096
The number 15000 shows that we have 15000 entries in our
dataset, while 4096 shows that we have 4096 columns,
representing the number of pixels in a picture a.k.a in an entry. Each
of our data is a square picture with \(64\times64\) pixels, and \(64\times64=4096\) (you can prove it by
yourself!)
Let’s take a peek at the first 36 entries of our training dataset
using vizTrain, a function made by Samuel Chan!
vizTrain <- function(input){
dimmax <- sqrt(ncol(input[,-c(1:3)]))
dimn <- ceiling(sqrt(nrow(input)))
par(mfrow=c(dimn, dimn), mar=c(.1, .1, .1, .1))
for (i in 1:nrow(input)){
m1 <- as.matrix(input[i,4:4099])
dim(m1) <- c(64,64)
m1 <- apply(apply(m1, 1, rev), 1, t)
image(1:64, 1:64,
m1, col=grey.colors(255),
# remove axis text
xaxt = 'n', yaxt = 'n')
text(25, 10, col="white", cex=1.2, input[i, 2])
}
}
vizTrain(sample_n(chinese_mnist, 36))Let’s take a look at the unique values from the label
column.
sort(unique(chinese_mnist$label))## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We can see that the labels start from 1, this can cause dimensional mismatch error since we are using Keras, a library built for Python (Python numbering starts from 0). To avoid that error, let’s rearrange the label so it starts from 0.
chinese_mnist <- chinese_mnist %>%
mutate(label = ifelse(label > 0, label-1, label))
sort(unique(chinese_mnist$label))## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
To validate the performance of our dataset, let’s split our dataset into training, validation, and testing data. This way, we can see the performance of our machine learning model when dealing with unseen data. We would split the data into 80% training, 10% validation, and 10% testing datasets.
set.seed(100)
train_index <- createDataPartition(as.factor(chinese_mnist$label), p = 0.8, list = FALSE)
data_train <- chinese_mnist[ train_index,]
data_test <- chinese_mnist[-train_index,]
set.seed(100)
test_index <- createDataPartition(as.factor(data_test$label), p = 0.5, list = FALSE)
data_val <- data_test[ test_index,]
data_test <- data_test[-test_index,]Remember the theory about pixels mentioned at the beginning? It is exhaustive for our computer to calculate values ranging from 0 to 255, so it is wise to scale the data to range from 0 to 1. This is called min-max scaling. This can be done by dividing the values in our data by the maximum value, in this case 255.
To ease further data processing, let’s separate the predictors and labels and put them into new variables.
data_train_x <- data_train %>%
select(-c(label,character,value)) %>% # take only the predictors
as.matrix()/255 # change the data type into matrix and do min-max scaling
data_train_y <- data_train$label # take only the labels
data_val_x <- data_val %>%
select(-c(label,character,value)) %>%
as.matrix()/255
data_val_y <- data_val$label
data_test_x <- data_test %>%
select(-c(label,character,value)) %>%
as.matrix()/255
data_test_y <- data_test$labelKeras only accept predictors in the form of array, and labels in the form of one-hot-encoded categories. One-hot encoding means that we are giving a “binary” code for each class.
Let’s change our predictors into arrays and do one-hot encoding for our labels.
# Change predictors to arrays
train_x <- array_reshape(data_train_x, dim=dim(data_train_x))
val_x <- array_reshape(data_val_x, dim=dim(data_val_x))
test_x <- array_reshape(data_test_x, dim=dim(data_test_x))
# One-hot encoding target variable
train_y <- to_categorical(data_train_y)
val_y <- to_categorical(data_val_y)
test_y <- to_categorical(data_test_y)In this project, we are going to use a Deep Neural Network (DNN). According to IBM, DNN is a neural network with more than three layers, including the input and output layers.
To build a DNN with Keras, there are two major steps that we should do.
Make a sequential model
To make a sequential model, we can use the function
keras_model_sequential()
Make the neural network layers
The most basic layer type is the dense layer. We can make a dense layer
using layer_dense() function. This function has several
parameters:
input_shape : the shape of our predictors, only used
for the first hidden layer
units : the number of neurons in a layer
activation : the activation function used for a
layer
name : (optional) the name of a layer
Note: for the last layer (output layer),
units has to be the same as the amount of target
variables.
To ease our value assignment for parameters in building the deep learning model, let’s put our data size values into new variables.
input_dim <- ncol(train_x) # dimension of predictors
num_class <- n_distinct(data_train$label)Let’s try to build a Deep Neural Network (DNN) with only two hidden layers, with 64 and 32 neurons/nodes for the first and second hidden layers consecutively. Since we want the data to be processed non-negatively, let’s use ReLU activation function for the hidden layers, and Softmax for the output layer since we are dealing with multiclass classification case.
model1 <- keras_model_sequential() %>%
# input layer + first hidden layer
layer_dense(input_shape = input_dim, # dimension of predictors
units = 64, # number of neurons/nodes
activation = "relu", # activation function
name = "hidden_1") %>%
# Dense layer
layer_dense(units = 32,
activation = "relu") %>% # to produce non-negative values
# output layer
layer_dense(units = num_class, # num. of target classes
activation = "softmax", # for multiclass classification case
name = "ouput")Since we are working with a multiclass classification case, we will use categorical cross-entropy as our loss function. Let’s try using Adam optimizer since it’s one of the most used optimizer. According to Jason Brownlee, PhD., the founder of Machine Learning Mastery,
The Adam optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.
If you are interested, you can read more about Adam optimizer straight from its inventors.
model1 %>%
compile(loss = loss_categorical_crossentropy(),
optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "accuracy")Time for the model to learn! Let’s use the fit()
function and pass epoch=15 to make our machine iterate its
learning process 15 times and evaluate the result after every 1000 data
entries with batch_size=1000. To see the accuracy on unseen
data, don’t forget to pass our validation datasets into the
validation_data parameter.
history <- model1 %>%
fit(x = train_x,
y = train_y,
epoch = 15,
validation_data = list(val_x, val_y),
batch_size = 1000)
plot(history)Now let’s predict the result of our testing dataset using the
predict() function.
pred <- predict(model1, test_x) %>%
k_argmax() %>% # take the highest probability value
as.array() %>%
as.factor()Last step: evaluation! We can use the confusionMatrix
function from caret library to produce a confusion matrix
and accuracy value.
confusionMatrix(data=pred, reference = as.factor(data_test$label),)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 0 86 0 0 0 2 0 0 0 0 0 1 9 2 4 0
## 1 0 100 6 0 0 0 1 0 1 0 0 0 0 0 0
## 2 0 0 80 10 1 1 1 0 0 0 0 0 0 0 0
## 3 0 0 11 84 0 2 5 0 0 1 0 0 1 0 0
## 4 0 0 0 0 83 0 2 0 0 2 1 0 1 2 1
## 5 0 0 1 3 0 85 4 5 0 5 0 5 2 0 2
## 6 0 0 0 3 0 1 66 5 4 9 0 2 1 0 4
## 7 0 0 0 0 3 3 1 81 1 3 1 3 1 3 3
## 8 0 0 0 0 1 0 2 0 85 1 1 0 0 1 2
## 9 0 0 0 0 1 3 7 4 1 66 0 2 3 9 10
## 10 0 0 1 0 1 1 0 1 0 0 90 0 20 1 0
## 11 12 0 0 0 5 2 3 1 1 3 0 67 0 17 1
## 12 1 0 1 0 0 1 5 1 1 0 5 2 69 2 0
## 13 1 0 0 0 1 1 2 0 2 3 1 9 0 61 0
## 14 0 0 0 0 2 0 1 2 4 7 0 1 0 0 77
##
## Overall Statistics
##
## Accuracy : 0.7867
## 95% CI : (0.7651, 0.8072)
## No Information Rate : 0.0667
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.7714
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.86000 1.00000 0.80000 0.84000 0.83000 0.85000
## Specificity 0.98714 0.99429 0.99071 0.98571 0.99357 0.98071
## Pos Pred Value 0.82692 0.92593 0.86022 0.80769 0.90217 0.75893
## Neg Pred Value 0.98997 1.00000 0.98579 0.98854 0.98793 0.98919
## Prevalence 0.06667 0.06667 0.06667 0.06667 0.06667 0.06667
## Detection Rate 0.05733 0.06667 0.05333 0.05600 0.05533 0.05667
## Detection Prevalence 0.06933 0.07200 0.06200 0.06933 0.06133 0.07467
## Balanced Accuracy 0.92357 0.99714 0.89536 0.91286 0.91179 0.91536
## Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity 0.66000 0.81000 0.85000 0.66000 0.90000 0.67000
## Specificity 0.97929 0.98429 0.99429 0.97143 0.98214 0.96786
## Pos Pred Value 0.69474 0.78641 0.91398 0.62264 0.78261 0.59821
## Neg Pred Value 0.97580 0.98640 0.98934 0.97561 0.99278 0.97622
## Prevalence 0.06667 0.06667 0.06667 0.06667 0.06667 0.06667
## Detection Rate 0.04400 0.05400 0.05667 0.04400 0.06000 0.04467
## Detection Prevalence 0.06333 0.06867 0.06200 0.07067 0.07667 0.07467
## Balanced Accuracy 0.81964 0.89714 0.92214 0.81571 0.94107 0.81893
## Class: 12 Class: 13 Class: 14
## Sensitivity 0.69000 0.61000 0.77000
## Specificity 0.98643 0.98571 0.98786
## Pos Pred Value 0.78409 0.75309 0.81915
## Neg Pred Value 0.97805 0.97252 0.98364
## Prevalence 0.06667 0.06667 0.06667
## Detection Rate 0.04600 0.04067 0.05133
## Detection Prevalence 0.05867 0.05400 0.06267
## Balanced Accuracy 0.83821 0.79786 0.87893
Seems like the accuracy is still quite low. Let’s improve our model by using convolutional layer in our Deep Neural Network.
Convolutional Neural Network (CNN) is a type of neural network that uses convolution layer. CNN is highly popular to be used for image data, as stated by IBM,
Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs.
A CNN typically consists of these layers.
Convolutional layer
A convolution layer works by implementing a filter towards the image. It
works by multiplying the pixel values in the image matrix with the
values in the filter matrix. The filter then shifts until all pixel
values are multiplied.
Here is an animated version of how convolutional layer works.
A convolution layer essentially extract features from your image, like horizontal lines, vertical lines, edges, etc. Our machine learn these features in the training process, and try to identify these patterns when given unseen data.
Pooling layer
Pooling layer is used to decrease computational effort without losing
important information. Pooling layer works similarly to convolutional
layer, except that the filter pooling layer does not implement
multiplication. Instead, it applies an aggregation function to the
captured matrix. There are two types of pooling:
Max pooling
Works by taking the maximum value of the captured matrix
Average pooling
Works by taking the average value of the captured matrix
Flatten layer
As the name suggests, flatten layer flattens the data from a matrix to a
very long vector so that a dense layer can process it. Layers processing
the data after the flatten layer are usually called fully-connected
layers.
Here is a diagram that shows the how layers mentioned above are connected.
To use convolution layer, we must go back to the step where we change the data type. The cell below looks almost identical to the cell in Change the Data Type section.
# Change predictors to arrays
train_x <- array_reshape(data_train_x, dim=c(dim(data_train_x)[1],64,64,1))
val_x <- array_reshape(data_val_x, dim=c(dim(data_val_x)[1],64,64,1))
test_x <- array_reshape(data_test_x, dim=c(dim(data_test_x)[1],64,64,1))
# One-hot encoding target variable
train_y <- to_categorical(data_train_y)
val_y <- to_categorical(data_val_y)
test_y <- to_categorical(data_test_y)Notice the difference? Correct! This time we need to change the
dim parameter inside the array_reshape()
function. You can see the difference better in the raw text chunk
below.
# Without convolution layer
train_x <- array_reshape(data_train_x, dim=dim(data_train_x))
val_x <- array_reshape(data_val_x, dim=dim(data_val_x))
test_x <- array_reshape(data_test_x, dim=dim(data_test_x))
# With convolution layer
train_x <- array_reshape(data_train_x, dim=c(dim(data_train_x)[1],64,64,1))
val_x <- array_reshape(data_val_x, dim=c(dim(data_val_x)[1],64,64,1))
test_x <- array_reshape(data_test_x, dim=c(dim(data_test_x)[1],64,64,1))
Now what does c(dim(data_train_x)[1],64,64,1) mean?
Explanation:
dim(data_train_x) = the number of rows in the
data_train_x dataset
64,64 = the dimension/size of the image
1 = the number of color channel(s), 1 for BW and 3
for RGB
Let’s just redo assigning data sizes into variables.
input_dim <- ncol(train_x)
num_class <- n_distinct(data_train$label)To make a convolution layer, we can use the
layer_conv_2d() function. Don’t forget to fill the
input_shape parameter for the first layer of our network.
This time we pass the value c(64,64,1) which means our data
is a two-dimensional array with the size of \(64\times64\) and one channel.
model2 <- keras_model_sequential() %>%
# Convolutional layer
layer_conv_2d(input_shape = c(64,64,1),
filters = 16,
kernel_size = c(3,3), # 3 x 3 filters
activation = "relu") %>%
# Max pooling layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Flattening layer
layer_flatten() %>%
# Dense layer
layer_dense(units = 32,
activation = "relu") %>%
# output layer
layer_dense(units = num_class,
activation = "softmax",
name = "ouput")Let’s just make the rest of things the same for the next steps 😃
model2 %>%
compile(loss = loss_categorical_crossentropy(),
optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "accuracy")history <- model2 %>%
fit(x = train_x,
y = train_y,
epoch = 15,
validation_data = list(val_x, val_y),
batch_size = 1000,
)
plot(history)Let’s predict our CNN model with the unseen test data.
pred <- predict(model2, test_x) %>%
k_argmax() %>%
as.array() %>%
as.factor()confusionMatrix(data=pred, reference = as.factor(data_test$label))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 0 97 0 0 0 0 1 0 0 0 1 0 0 0 0 0
## 1 0 99 6 2 0 0 1 0 0 0 0 0 0 0 0
## 2 0 1 83 10 0 0 2 0 0 0 0 0 1 0 0
## 3 0 0 11 88 0 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 92 1 1 1 0 1 0 0 0 0 1
## 5 0 0 0 0 0 93 0 2 0 1 0 4 0 0 0
## 6 0 0 0 0 0 0 84 3 0 2 1 0 1 3 0
## 7 0 0 0 0 2 1 0 85 0 2 0 1 1 0 0
## 8 0 0 0 0 0 0 1 0 96 2 0 0 0 0 2
## 9 1 0 0 0 2 0 1 5 1 85 0 0 2 0 6
## 10 0 0 0 0 0 0 1 0 0 0 95 1 6 1 0
## 11 1 0 0 0 3 1 1 1 1 1 0 88 1 5 0
## 12 0 0 0 0 0 0 1 1 0 0 2 0 85 0 0
## 13 1 0 0 0 0 0 7 0 1 2 2 6 3 91 0
## 14 0 0 0 0 1 0 0 2 1 3 0 0 0 0 91
##
## Overall Statistics
##
## Accuracy : 0.9013
## 95% CI : (0.8851, 0.916)
## No Information Rate : 0.0667
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8943
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.97000 0.99000 0.83000 0.88000 0.92000 0.93000
## Specificity 0.99857 0.99357 0.99000 0.99000 0.99643 0.99500
## Pos Pred Value 0.97980 0.91667 0.85567 0.86275 0.94845 0.93000
## Neg Pred Value 0.99786 0.99928 0.98788 0.99142 0.99430 0.99500
## Prevalence 0.06667 0.06667 0.06667 0.06667 0.06667 0.06667
## Detection Rate 0.06467 0.06600 0.05533 0.05867 0.06133 0.06200
## Detection Prevalence 0.06600 0.07200 0.06467 0.06800 0.06467 0.06667
## Balanced Accuracy 0.98429 0.99179 0.91000 0.93500 0.95821 0.96250
## Class: 6 Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity 0.84000 0.85000 0.96000 0.85000 0.95000 0.88000
## Specificity 0.99286 0.99500 0.99643 0.98714 0.99357 0.98929
## Pos Pred Value 0.89362 0.92391 0.95050 0.82524 0.91346 0.85437
## Neg Pred Value 0.98862 0.98935 0.99714 0.98926 0.99642 0.99141
## Prevalence 0.06667 0.06667 0.06667 0.06667 0.06667 0.06667
## Detection Rate 0.05600 0.05667 0.06400 0.05667 0.06333 0.05867
## Detection Prevalence 0.06267 0.06133 0.06733 0.06867 0.06933 0.06867
## Balanced Accuracy 0.91643 0.92250 0.97821 0.91857 0.97179 0.93464
## Class: 12 Class: 13 Class: 14
## Sensitivity 0.85000 0.91000 0.91000
## Specificity 0.99714 0.98429 0.99500
## Pos Pred Value 0.95506 0.80531 0.92857
## Neg Pred Value 0.98937 0.99351 0.99358
## Prevalence 0.06667 0.06667 0.06667
## Detection Rate 0.05667 0.06067 0.06067
## Detection Prevalence 0.05933 0.07533 0.06533
## Balanced Accuracy 0.92357 0.94714 0.95250
Fantastic, the accuracy of our model just went 12% higher!
Due to the complexity, Chinese characters are not easy to learn. Yet, in this project, we have successfully made our machine learn from images of handwritten Chinese digits with Deep Neural Network and Convolutional Neural Network. With ordinary DNN (using only dense layers), we’ve achieved a test accuracy of around 78%, while with CNN, we’ve achieved 90% test accuracy. This shows that CNN improves model training from image data, as stated by the references mentioned above. Further application of CNN for image data is image recognition. Digit and letter recognition, for example, can be developed into language translation from an image.